Bayesian smoothing of confidential data values at organization level using peer organization group

ABSTRACT

In an example embodiment, submitted confidential data of a certain cohort (e.g., title, region, organization) is augmented by modeling confidential data of a more generalized cohort based on peer organizations. The modeling may be performed using Bayesian modeling and the results used to infer confidential data values for the original cohort. The inferred confidential data values can then be used to generate statistical insights for display in a graphical user interface.

TECHNICAL FIELD

The present disclosure generally relates to computer technology forsolving technical challenges in collection and maintenance ofconfidential data in a computer system. More specifically, the presentdisclosure relates to predicting confidential data value insights at anorganization level using peer organization group.

BACKGROUND

In various types of computer systems, there may be a need to collect,maintain, and utilize confidential data. In some instances, users may bereluctant to share this confidential information due to privacyconcerns. These concerns extend not only to pure security concerns, suchas concerns over whether third parties such as hackers may gain accessto the confidential data, but also to how the computer system itself mayutilize the confidential data. With certain types of data, usersproviding the data may be somewhat comfortable with uses of the datathat maintain anonymity, such as the confidential data merely being usedto provide broad statistical analysis to other users.

One example of such confidential data is salary/compensationinformation. It may be desirable for a service such as a online serviceto request its members to provide information about their salary orother work-related compensation in order to provide members withinsights into various metrics regarding salary/compensation, such as anaverage salary for a particular job type in a particular city. There aretechnical challenges encountered, however, in ensuring that suchconfidential information remains confidential and is only used forspecific purposes, and it can be difficult to convince members toprovide such confidential information due to their concerns that thesetechnical challenges may not be met.

Additionally, viewers of reports related to the confidential data (suchas aggregated salary statistics) are often interested in gaininginsights on the confidential data at the organization (e.g., company)level. This, however, can be challenging in environments where there isnot enough confidential data at the organization level to providemeaningful insights, such as in the case of small or medium sizedorganizations or in cohorts where a large organization only has a smallpresence in a location or title of interest (e.g., Microsoft jobs inIndianapolis or jobs as a construction worker for Google).

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments of the technology are illustrated, by way of exampleand not limitation, in the figures of the accompanying drawings.

FIG. 1 is a block diagram illustrating a confidential data collection,tracking, and usage system, in accordance with an example embodiment.

FIGS. 2A-2C are screen captures illustrating an example of a userinterface provided by a confidential data frontend, in accordance withan example embodiment.

FIG. 3 is a flow diagram illustrating a method for confidential datacollection and storage, in accordance with an example embodiment.

FIG. 4 is a diagram illustrating an example of a submission table, inaccordance with an example embodiment.

FIG. 5 is a flow diagram illustrating a method for confidential datacollection and storage, in accordance with an example embodiment.

FIG. 6 is a diagram illustrating an example of a first submission tableand a second submission table, in accordance with an example embodiment.

FIG. 7 is a flow diagram illustrating a method in accordance with anexample embodiment.

FIG. 8 is a screen capture illustrating an organization confidentialdata social networking profile page in accordance with an exampleembodiment.

FIG. 9 is a flow diagram illustrating another method in accordance withan example embodiment.

FIG. 10 is a screen capture of a graphical user interface in accordancewith an example embodiment.

FIG. 11 is a block diagram illustrating a representative softwarearchitecture, which may be used in conjunction with various hardwarearchitectures herein described.

FIG. 12 is a block diagram illustrating components of a machine,according to some example embodiments, able to read instructions from amachine-readable medium (e.g., a machine-readable storage medium) andperform any one or more of the methodologies discussed herein.

DETAILED DESCRIPTION

The present disclosure describes, among other things, methods, systems,and computer program products. In the following description, forpurposes of explanation, numerous specific details are set forth inorder to provide a thorough understanding of the various aspects ofdifferent embodiments of the present disclosure. It will be evident,however, to one skilled in the art, that the present disclosure may bepracticed without all of the specific details.

In an example embodiment, an architecture is provided that gathersconfidential information from users, tracks the submission of theconfidential information, and maintains and utilizes the confidentialinformation in a secure manner while ensuring that the confidentialinformation is accurate and reliable.

Furthermore, in an example embodiment, the architecture is extended byproviding components for reliably inferring confidential data insights(e.g., average salary) for cohorts with little or no actual submittedconfidential data. This is performed by inferring confidential datavalues based on organizations that are considered to be peers to theorganization of interest. This solution involves two parts: thegeneration of a novel, semantic representation (embedding) oforganizations to be used to compute a similarity measure between any twoorganizations, and the use of the semantic representation to compute apeer organization group of a given company.

Additionally, in some example embodiments, Bayesian smoothing is used onconfidential data for peer organizations of an organization of interest,in order to correct for a lack of available pieces of confidential datafor the organization of interest itself and/or prevent detection ofindividual confidential data values by users.

FIG. 1 is a block diagram illustrating a confidential data collection,tracking, and usage system 100, in accordance with an exampleembodiment. A client device 102 may utilize a confidential data frontend104 to submit confidential information to a confidential data backend106. In some example embodiments, the confidential data backend 106 islocated on a server-side or cloud platform 107 while the confidentialdata frontend 104 is directly connected to or embedded in the clientdevice 102. However, in some example embodiments, the confidential datafrontend 104 is also located on the server-side or cloud platform 107.

There may be various different potential implementations of theconfidential data frontend 104, depending upon the type andconfiguration of the client device 102. In an example embodiment, theconfidential data frontend 104 may be a web page that is served to a webbrowser operating on the client device 102. The web page may includevarious scripts, such as JavaScript code, in addition to HypertextMarkup Language (HTML) and Cascading Style Sheets (CSS) code designed toperform various tasks that will be described in more detail below. Theweb page may be served in response to the user selecting a link in aprevious communication or web page. For example, the link may bedisplayed in an email communication to the user or as part of a feedsection of the user's online service member page. This allows the entityoperating the confidential data collection, tracking, and usage system100 to selectively target users to request that they submit confidentialinformation. For example, the entity may determine that there is a needto obtain more salary information for users from Kansas and then maysend out communications to, or cause the online service to alter feedsof, users in a manner that allows the users to select the link to launchthe confidential data frontend 104.

In another example embodiment, the confidential data frontend 104 may bebuilt into an application installed on the client device 102, such as astandalone application running on a smartphone. Again this confidentialdata frontend 104 is designed to perform various tasks that will bedescribed in more detail below.

One task that the confidential data frontend 104 may be designed toperform is the gathering of confidential data from a user of the clientdevice 102. Another task that the confidential data frontend 104 may bedesigned to perform is to display insights from confidential datacontributed by other users. In order to incentivize users to providecertain types of confidential data, in an example embodiment, insightsfrom the confidential data contributed by other users are provided inresponse to the user contributing his or her own confidential data. Aswill be described in more detail, a mechanism to ensure that thecontribution of confidential data is tracked is provided.

Once the confidential data is received from the user, the confidentialdata frontend 104 may transmit the confidential data along with anidentification of the user (such as a member identification reflectingthe user's account with a online service) to the confidential databackend 106. In an example embodiment, this may be performed via, forexample, a REST Application Program Interface (API).

The confidential data, along with the identification of the user, may bestored in a submission table by the confidential data backend 106 in aconfidential information database 108. In some example embodiments, thissubmission table may be encrypted in order to ensure security of theinformation in the submission table. Furthermore, in some exampleembodiments, the confidential data stored in the submission table may beencrypted using a different key than the identifying information in thesubmission table. This encryption will be described in more detailbelow.

In another example embodiment, a random transaction number is generatedfor each confidential data submission. This random transaction number isstored with the identifying information in one table, and then storedwith the confidential data in another table, with each table encryptedseparately using a different key. In either this example embodiment orthe previous example embodiment, encrypting the identifying informationseparately from the confidential data (either in one table or inseparate tables) provides added security against the possibility that amalicious user could gain access to one or the other. In other words,even if a malicious user gained access to the identifying informationby, for example, hacking the encryption used to encrypt the identifyinginformation, that would not allow the malicious user to gain access tothe confidential data corresponding to the identifying information, andvice versa. In an example embodiment, the encryption mechanism used isone that is non-deterministic, such that the same information encryptedtwice would produce different results in each encryption. In anotherexample embodiment, the transaction number itself is also encrypted,thereby preventing even the act of joining separate tables containingthe identifying information and the confidential data.

In an example embodiment, a submission table may also be able to trackwhen submissions were made by users. As such, the submission table mayinclude additional columns, such as, for example, a submissionidentification, an identification of the user who made the submission,an encryption key for the submission, and timestamp information aboutwhen the submission was made. The submission table may then be utilizedby the confidential data backend 106 to determine, for example, when toshare insights from submissions from other users to a particular user.If, for example, the user has previously submitted confidential data andhas done so recently (e.g., within the last year), then the confidentialdata backend 106 may indicate to the confidential data frontend 104 thatit should share insights from confidential data from other users withthis particular user.

There may be other methods than those described above for determiningeligibility of a user for receiving insights from submissions from otherusers. For example, a predicate expressed in terms of one or moreattributes may need to be satisfied in order to receive the insights,such as particular demographic or profile-based attributes. Theseattributes can include any such attribute, from location to title, tolevel of skill, to online service activities or status (e.g., about totransition from being an active member to an inactive member), totransactional attributes (e.g., purchased a premium subscription).

Additionally, any combination of the above factors can be used todetermine whether the user is eligible for receiving insights fromsubmissions from other users.

Furthermore, the submission table may also include one or moreattributes of the user that made the submission. These attributes may beattributes that can be useful in determining a slice to which the userbelongs. Slices will be described in more detail below, but generallyinvolve a segment of users sharing common attributes, such as titles,locations, educational levels, and the like. It should be noted that itis not necessary for these attributes to be stored in the submissiontable. Since an identification of the user is available in thesubmission table, it may be possible to retrieve the attributes for theuser on an as-needed basis, such as by querying a online service withthe user identification when needed.

A databus listener 110 detects when new confidential data is added tothe confidential information database 108 and triggers a workflow tohandle the new confidential data. First, the databus listener 110queries a thresholds data store 116 to determine if one or morethresholds for anonymization have been met. Specifically, until acertain number of data points for confidential data have been met, theconfidential data collection, tracking, and usage system 100 will notact upon any particular confidential data point. As will be described inmore detail later, these thresholds may be created on a per-slice basis.Each slice may define a segment of users about which insights may begathered based on data points from confidential data submitted by usersin the slice. For example, one slice may be users with the title“software engineer” located in the “San Francisco Bay Area.” If, forexample, the confidential data is compensation information, then it maybe determined that in order to gain useful insights into thecompensation information for a particular title in a particular region,at least ten data points (e.g., compensation information of tendifferent users) are needed. In this case, the threshold for “softwareengineer” located in “San Francisco Bay Area” may be set at ten. Thedatabus listener 110, therefore, is designed to retrieve theconfidential data added to the confidential information database 108,retrieve the threshold for the slice corresponding to attributes of theuser (as stored, for example, in the submission table in theconfidential information database 108 or retrieved at runtime from aonline service), determine if the new data point(s) cause the thresholdfor the corresponding slice to be exceeded, and, if so, or if thethreshold had already been exceeded, insert the data in a backend queue112 for extract, transform, and load (ETL) functions.

In an example embodiment, the thresholds data store 116 contains notjust the thresholds themselves but also a running count of how many datapoints have been received for each slice. In other words, the thresholdsdata store 116 indicates how close the slice is to having enough datapoints with which to provide insights. The databus listener 110 mayreference these counts when making its determination that a newlysubmitted data point causes a threshold to be exceeded. Running countsof data points received for each slice are updated in the thresholdsdata store 116 by the confidential data backend 106.

Since the databus listener 110 only transfers data points for aparticular slice to the backend queue 112 once the threshold for thatslice has been exceeded, the confidential data data points correspondingto that slice may need to be retrieved from the confidential informationdatabase 108 once the threshold is determined to be exceeded. Forexample, if, as above, the threshold for a particular slice is ten datapoints, the first nine data points received for that slice may simply beleft in the confidential information database 108 and not sent to thebackend queue 112. Then, when the tenth data point for the slice isstored in the confidential information database 108, the databuslistener 110 may determine that the threshold has been exceeded andretrieve all ten data points for the slice from the confidentialinformation database 108 and send them to the backend queue 112 forprocessing.

It should be noted that the information obtained by the databus listener110 from the confidential information database 108 and placed in thebackend queue 112 is deidentified. In an example embodiment, noidentification of the users who submitted the confidential data isprovided to the backend queue 112. Indeed, in some example embodiments,the information provided to the backend queue 112 may simply be theconfidential data itself and any information needed in order to properlygroup the confidential data into one or more slices. For example, ifslices are designed to group user confidential data based only on usertitle, location, and years of experience, other attributes for the userthat might have been stored in the confidential information database108, such as schools attended, may not be transferred to the backendqueue 112 when the confidential data tied to those attributes istransferred to the backend queue 112. This further helps to anonymizethe data, as it makes it more difficult for people to be able to deducethe identity of a user based on his or her attributes.

It should also be noted that any one piece of confidential data maycorrespond to multiple different slices, and thus the databus listener110 may, in some example embodiments, provide the same confidential datato the backend queue 112 multiple times. This can occur at differenttimes as well, because each of the slices may have its own thresholdthat may be transgressed at different times based on different counts.Thus, for example, compensation data for a user in the “San FranciscoBay Area” with a job title of “software developer” and a school attendedas “Stanford University” may be appropriately assigned to one slice ofsoftware developers in the San Francisco Bay Area, a slice of StanfordUniversity alums, and a slice of software developers in the UnitedStates. All slices may have their own thresholds and counts fromconfidential data from other users, who may or may not have completeoverlap with these three slices.

An ETL backend 114 acts to extract, transform, and load the confidentialdata to anonymize and group it and place it back in the confidentialinformation database 108 in a different location from where it wasstored in non-deidentified form. It should be noted that in some exampleembodiments, the anonymization described above with respect to thedatabus listener 110 may actually be performed by the ETL backend 114.For example, the databus listener 110 may send non-deidentifiedconfidential data along with all attributes to the backend queue 112,and it may be the ETL backend 114 that reviews this data and discardscertain elements of it to anonymize it.

In an example embodiment, the confidential information is stored inencrypted format in the confidential information database 108 when thedatabus listener 110 sends it to the backend queue 112. As such, onefunction of the ETL backend 114 is to decrypt the confidentialinformation. Encryption and decryption of the confidential data will bediscussed in more detail below.

The ETL backend 114 writes the deidentified confidential data and sliceinformation into an ETL table corresponding to the slice in theconfidential information database 108. As described earlier, this ETLtable may be stored in a different location than that in which theconfidential data was stored initially, such as the submission tabledescribed earlier.

At a later time, and perhaps using a batch or other periodic process,the information from the ETL table may be loaded in a distributed filesystem (DFS) 118. A confidential data relevance workflow 120 may thenextract relevant information from the DFS 118 and provide one or moreinsights into the relevant information in a confidential data insightsdata store 122. A confidential data relevance API 124 may then beutilized to provide insights from the confidential data insights datastore 122 to the confidential data frontend 104, which can then displaythem to a user. As described earlier, these insights may be providedonly on a “give-to-get” basis, namely that only users who provideconfidential information (and/or have provided it recently) can viewinsights.

Turning now to more detail about the submission process, FIGS. 2A-2C arescreen captures illustrating an example of a user interface 200 providedby the confidential data frontend 104, in accordance with an exampleembodiment. Referring first to FIG. 2A, the user interface 200 here isdepicted as a screen of a standalone application operating on a mobiledevice, such as a smartphone. In FIG. 2A, the user is prompted to entera base salary in a text box 202, with a drop-down menu providing optionsfor different time periods on which to measure the base salary (e.g.,per year, per month, per hour, etc.). Additionally, the user may beidentified by name at 204, the user's title may be identified at 206,and the user's current employer may be identified at 208. Thisinformation may be pre-populated into the user interface 200, such as byretrieving this information from a member profile for the user in aonline service. This eliminates the need for the user to enter thisinformation manually, which can have the effect of dissuading some usersfrom providing the confidential information or completing the submissionprocess, especially on a mobile device where typing or otherwiseentering information may be cumbersome.

Turning to FIG. 2B, here the user interface 200 displays a number ofother possible compensation types 210-220 from which the user canselect. Selecting one of these other possible compensation types 210-220causes the user interface 200 to provide an additional screen where theuser can submit confidential data regarding the selected compensationtype 210-220. Here, for example, the user has selected “Stock” 212.Referring now to FIG. 2C, the user interface 200 then switches to thisscreen, which allows the user to provide various specific details aboutstock compensation, such as restricted stock unit (RSU) compensation 222and options 224. The user interface 200 at this stage may also displaythe other compensation types 210-220 that the user can make additionalsubmissions for.

Referring back to FIG. 2B, when the user has completed entering all theconfidential data, such as all the different compensation typesappropriate for his or her current job, a “Get insights” button 226 maybe selected, which launches a process by which the confidential databackend 106 determines whether the user is eligible to receive insightsfrom confidential data from other users and, if so, indicates to theconfidential data backend 106 that the insights should be provided.Additionally, selection of the “Get insights” button 226 represents anindication that the submission of the confidential data by this user hasbeen completed, causing the confidential data backend 106 to store theconfidential data in the confidential information database 108 asdescribed below, which then may trigger the databus listener 110 toextract the confidential information and cause the ETL backend 114 toanonymize the confidential data and place it in the appropriate ETLtables corresponding to the appropriate slices in which the confidentialdata belongs. This permits the submitted confidential data to beavailable for future insights.

FIG. 3 is a flow diagram illustrating a method 300 for confidential datacollection and storage, in accordance with an example embodiment. In anexample embodiment, the method 300 may be performed by the confidentialdata backend 106 of FIG. 1. At operation 302, confidential data isobtained. At operation 304, an identification of the user who submittedthe confidential data is obtained. It should be noted that whileoperations 302 and 304 are listed separately, they may be performed inthe same operation in some example embodiments. For example, in anexample embodiment, the confidential data frontend 104 may, uponreceiving an indication from a user that input of confidential data inthe confidential data frontend 104 by the user has been completed,forward the inputted confidential data and an identification of the userto the confidential data backend 106. In other example embodiments,however, the operations 302 and 304 may be performed separately. Forexample, in an example embodiment, the identification of the user maynot be obtained directly from the confidential data frontend 104, butrather some other type of identifying information may be obtaineddirectly from the confidential data frontend 104, and this other type ofidentifying information may be used to query a online service or otherthird-party service for the identification information for the user.Regardless, after operations 302 and 304 have been performed, theconfidential data backend 106 has at its disposal some confidential dataand identification information for the user who entered the confidentialdata.

It should be noted that the confidential data may be a single piece ofinformation, or may be multiple, related pieces of information. Forexample, the confidential data may simply include a total compensationvalue and nothing more, or may include a complete breakdown of differenttypes of compensation (e.g., base salary, bonus, stock, etc.).

Users are understandably concerned about the security of theconfidential information, and specifically about a malicious user beingable to correlate the confidential information and the identification ofthe user (i.e., not just learning the confidential information but tyingthe confidential information specifically to the user). As such, atoperation 306, the confidential data is encrypted using a first key andstored in a first column of a submission table in a confidentialinformation database. Then, at operation 308, the identification of theuser who submitted the confidential data is separately encrypted using asecond key and stored in a second column of the submission table in theconfidential information database.

Additionally, a number of optional pieces of information may, in someexample embodiments, be stored in the submission table at this point. Atoperation 310, a timestamp of the submission of the confidential datamay be stored in a column in the submission table. This timestamp may beused in, for example, a determination of whether the user is eligible toreceive insights from confidential data submitted by other users. Atoperation 312, one or more attributes of the user may be stored as oneor more columns in the submission table. These attributes may be used,for example, in determining to which slice(s) the confidential data mayapply, as will be described in more detail below.

FIG. 4 is a diagram illustrating an example of a submission table 400,in accordance with an example embodiment. Each row in the submissiontable 400 corresponds to a different submission. Here, the submissiontable 400 includes five columns. In a first column 402, confidentialdata encrypted by a first key is stored. In a second column 404,identification of the user who submitted the corresponding confidentialdata, encrypted by a second key, is stored. In a third column 406, atimestamp for the submission is stored. In a fourth column 408, a firstattribute of the user, here location, is stored. In a fifth column 410,a second attribute of the user, here title, is stored. Of course, theremay be additional columns to store additional attributes or other piecesof information related to the submission.

Notably, FIG. 4 depicts an example embodiment where only the first andsecond columns 402, 404 are encrypted, using different encryption keys.In some example embodiments, the additional columns 406-410 may also beencrypted, either individually or together. In some example embodiments,one or more of these additional columns 406-410 may be encrypted usingthe same key as the first or second column 402, 404. Furthermore, insome example embodiments, the submission table 400 may be additionallyencrypted as a whole, using a third encryption key different from thekeys used to encrypt the first and second columns 402, 404.

It should be noted that while FIGS. 3 and 4 describe the confidentialdata as being stored in a single column in a submission table, in someexample embodiments, this column is actually multiple columns, ormultiple sub-columns, with each corresponding to a subset of theconfidential data. For example, if the confidential data is compensationinformation, the confidential data may actually comprise multipledifferent pieces of compensation information, such as base salary,bonus, stock, tips, and the like. Each of these pieces of compensationinformation may, in some example embodiments, have its own column in thesubmission table. Nevertheless, the processes described herein withregard to the “column” in which the confidential data is stored applyequally to the embodiments where multiple columns are used (e.g., theindividual pieces of compensation information are still encryptedseparately from the user identification information).

FIG. 5 is a flow diagram illustrating a method 500 for confidential datacollection and storage, in accordance with an example embodiment. Incontrast with FIG. 3, FIG. 5 represents an example embodiment where theconfidential data and the identification of the user who submitted theconfidential data are stored in separate tables in order to provideadditional security. At operation 502, confidential data is obtained. Atoperation 504, an identification of the user who submitted theconfidential data is obtained. As in FIG. 3, while operations 502 and504 are listed separately, in some example embodiments they may beperformed in the same operation.

At operation 506, a transaction identification is generated. Thistransaction identification may be, for example, a randomly generatednumber or character sequence that uniquely identifies the submission. Atoperation 508, the transaction identification may be encrypted using afirst key. At operation 510, the transaction information (eitherencrypted or not, depending upon whether operation 508 was utilized) isstored in a first column in a first submission table and in a firstcolumn in a second submission table in a confidential informationdatabase.

At operation 512, the confidential data is encrypted using a second keyand stored in a second column of the first submission table in theconfidential information database. Then, at operation 514, theidentification of the user who submitted the confidential data isseparately encrypted using a third key and stored in a second column ofthe second submission table in the confidential information database.

Additionally, as in FIG. 3, a number of optional pieces of informationmay, in some example embodiments, be stored in the first and/or secondsubmission tables at this point. At operation 516, a timestamp of thesubmission of the confidential data may be stored in a column in thesecond submission table. This timestamp may be used in, for example, adetermination of whether the user is eligible to receive insights fromconfidential data submitted by other users. At operation 518, one ormore attributes of the user may be stored as one or more columns in thesecond submission table. These attributes may be used, for example, indetermining to which slice(s) the confidential data may apply, as willbe described in more detail below. It should be noted that whileoperations 516 and 518 are described as placing information in thesecond submission table, in other example embodiments, one or more ofthese pieces of information may be stored in the first submission table.

If operation 508 is utilized, then the fact that the transactionidentification is encrypted and is the only mechanism by which to linkthe confidential data in the first submission table with the useridentification in the second submission table through a join operationprovides an additional layer of security.

FIG. 6 is a diagram illustrating an example of a first submission table600 and a second submission table 602, in accordance with an exampleembodiment. Each row in each of the first and second submission tables600, 602 corresponds to a different submission. Here, the firstsubmission table 600 includes two columns. In a first column 604,transaction identification information encrypted by a first key isstored. In a second column 606, confidential data encrypted by a secondkey is stored.

The second submission table 602 includes five columns. In a first column608, transaction identification information encrypted by the first keyis stored. In a second column 610, identification of the user whosubmitted the corresponding confidential data, encrypted by a third key,is stored. In a third column 612, a timestamp for the submission isstored. In a fourth column 614, a first attribute of the user (herelocation) is stored. In a fifth column 616, a second attribute of theuser, here title, is stored. Of course, there may be additional columnsto store additional attributes or other pieces of information related tothe submission.

Notably, FIG. 6 depicts an example embodiment where only the first andsecond columns 608, 610 of the second submission table 602 areencrypted, using different encryption keys. In some example embodiments,the additional columns 612-616 may also be encrypted, eitherindividually or together. Furthermore, in some example embodiments, thefirst and/or second submission tables 600, 602 may be additionallyencrypted as a whole, using an additional encryption key(s) differentfrom the keys described previously.

It should be noted that while FIGS. 5 and 6 describe the confidentialdata as being stored in a single column in a first submission table, insome example embodiments this column is actually multiple columns, ormultiple sub-columns, with each corresponding to a subset of theconfidential data. For example, if the confidential data is compensationinformation, the confidential data may actually comprise multipledifferent pieces of compensation information, such as base salary,bonus, stock, tips, and the like. Each of these pieces of compensationinformation may, in some example embodiments, have its own column in thefirst submission table. Nevertheless, the processes described hereinwith regard to the “column” in which the confidential data is storedapply equally to the embodiments where multiple columns are used (e.g.,the individual pieces of compensation information are still encryptedseparately from the user identification information).

As described above, there is a need to handle situations whereconfidential data for certain combinations of cohorts, such as certaincombinations of organizations and locations/job titles, are sparse.

In an example embodiment, information in a online service is mined togenerate a novel, semantic representation (embedding) of organizations.Organization embeddings are learned from transition data (i.e.,information about users who transitioned from one organization toanother). Pairwise similarity values between organizations can then becomputed based on these embeddings.

It should be noted that while in one embodiment these pairwisesimilarity values can be used to infer confidential data values, asdescribed in more detail below, they can also be used in otherscenarios, such as to display the set of organizations most similar to agiven organization as part of an organization profile in the onlineservice; perform audience expansion in advertising by showing ads notjust to members of a particular organization but also to members fromsimilar organizations; generate large candidate sets for “organizationqueries” with insufficient results in searches to users, by expandingthe result to include similar organizations; and may be used as afeature in various search or recommendation systems.

For purposes of this solution, two organizations are considered to besimilar if employees are very likely to move from one organization tothe other. The notion of a peer score between two companies is formallydefined. An algorithm, called Company2vec, may then be used to learncompany embeddings from transition data, which uses techniques such asnegative sampling and stochastic gradient descent to map eachorganization to its latent representations. From these embeddings, thepeer scores may then be computed.

Two organizations u and v are considered peers if organization v isamong top choices for employees in organization u to transition to andvice versa. Similarity between organizations u and v is then measuredvia peer score, defined as follows:

${p\;{s\left( {u,v} \right)}}:={\frac{P\left( {c_{1} = {\left. v \middle| c_{0} \right. = u}} \right)}{\max_{w}{P\left( {c_{1} = {\left. w \middle| c_{0} \right. = u}} \right)}} \cdot \frac{P\left( {c_{1} = {\left. u \middle| c_{0} \right. = v}} \right)}{\max_{w}{P\left( {c_{1} = {\left. w \middle| c_{0} \right. = v}} \right)}}}$where c₀ denotes the organization prior to the transition, c₁ denotesthe organization after the transition, and P(c₁=v|c₀=u) is thus theprobability of a user to transition to organization v conditioned on thecurrent organization being u. Peer score, ps (u,v), has a range of(0:1), and reaches its maximum 1 if companies u and v are each other'stop transition choice, i.e.,

$v = {\underset{w}{argmax}\;{P\left( {c_{1} = {\left. w \middle| c_{0} \right. = u}} \right)}}$$u = {\underset{w}{argmax}\;{P\left( {c_{1} = {\left. w \middle| c_{0} \right. = v}} \right)}}$

Without loss of generality, P(c₁=v|c₀=u) is henceforth denoted asP(v|u).

For each user of a online service, the user's work experiences can bearranged into a list of organization transitions in time order, andthese transitions can be used as positive samples in training. Forexample, if a user lists consecutive work experiences in companies A, B,and C, then the company A to B transition and the B to C transition aremarked as positive in training.

Negative sampling can also be applied to approximately estimate thetransition probability and calculate a defined peer score. Here, eachorganization u is mapped to its embeddings, including ϕ_(u) in thelatent transition origin space Φ⊂

^(n) and ψ_(u) in the latent transition destination space ψ⊂

^(m). For each organization u, K organizations not sharing anytransition with organization u can be randomly retrieved as negativesamples, and the transition probability can be calculated as follows:

${{P\left( v \middle| u \right)} = {{\sigma\left( {\phi_{u}^{T}\psi_{v}} \right)}{\prod\limits_{k = 1}^{K}\;{\sigma\left( {{- \phi_{u}^{T}}\psi_{w_{k}{(u)}}} \right)}}}},$where σ(x)=1/(1+exp(−x)) is the sigmoid function, and N_(u):={w₁(u),w₂(u), . . . , w_(K)(u)} denotes the set of K randomly sampled negativecompanies of company u. Latent space dimension m and negative samplesize K are two parameters to be chosen empirically based on the datasize.

With the embeddings, the peer organization score can be approximatelycomputed by randomly marginalizing out the denominator as:

${p\;{s\left( {u,v} \right)}} \approx {\frac{\sigma\left( {\phi_{u}^{T}\psi_{v}} \right)}{\max\limits_{w}\;{\sigma\left( {\phi_{u}^{T}\psi_{w}} \right)}} \cdot {\frac{\sigma\left( {\phi_{v}^{T}\psi_{v}} \right)}{\max\limits_{w}\;{\sigma\left( {\phi_{u}^{T}\psi_{w}} \right)}}.}}$

The problem can then be interpreted as an optimization problem oflearning latent transition origin embeddings Φ_(C)={Φ_(u):u∈C} anddestination embeddings ψ_(C)={ψ_(u):u∈C} for all companies C withobjective function as the log likelihood of all pairs of transitions σ,

$\sum\limits_{{({u,v})}\epsilon\; T}{\underset{{\{ v\}}\bigcup N_{u}}{\sum\limits_{z\epsilon}^{\;}}{\left\{ {{1_{\{{z = v}\}} \cdot {\log\left\lbrack {\sigma\left( {\phi_{u}^{T}\psi_{z}} \right)} \right\rbrack}} + {{\left( {1 - 1_{\{{z = v}\}}} \right) \cdot \log}\left\lceil {1 - {\sigma\left( {\phi_{u}^{T}\psi_{z}} \right)}} \right\rceil}} \right\}.}}$This optimization problem can be solved by stochastic gradient descent(SGD) and iterating between updating origin and destination embeddingsuntil convergence. With a chosen learning rate n in SGD, the pseudocodefor origin embedding ϕ_(u) for each positive ordered transition pair (u,v) given destination embeddings of the destination v and all negativesamples as input, is shown below:input: {ψ_(z):z∈{v}∪N_(u)}, i.e., destination embeddings of the positivedestination company v and all negative samples in N_(u).output: ϕ_(u), i.e., origin embedding of company u.procedure UPDATE(ϕ_(u))

e ← 0 * Initiate e to be 0 for z ∈ {ν} N_(u) do g ← η · [1_({z=ν}) −σ(ϕ_(u) ^(T)Ψ_(z))] e ← e + g · ψ_(z) ψ_(z) ← ψ_(z) + g · ϕu ϕu ← ϕu + eend for end procedure

As described briefly above, a second aspect may be the formation of peerorganization groups using the peer scores. Specifically, in one exampleembodiment, a matrix of organization-organization peer scores can begenerated and matrix factorization, such as singular value decomposition(SVD) or L-U decomposition may be performed to generate peerorganization group clusters. In another example embodiment, for a firstorganization, peer scores for all other companies are computed andranked. A number of the top ranked peer companies are then considered tobe in the set of peer organizations. This set of top ranked peercompanies may be selected based on a preset number of organizations(e.g., the top 5 organizations are selected as peers) or in comparisonto a threshold (e.g., all organizations having a peer score above X areselected as peers).

In another example embodiment, organizations are partitioned by acombination of one or more of industry, location, title, or function,and matrix factorization analysis is performed separately over eachpartition in order to obtain the peer organization groups.

In another example embodiment, unsupervised classification andclustering methods, such as K-means clustering and Gaussian Processclassification, are used to generate peer company groups.

In another example embodiment, the appropriate matrix of peer scores isanalyzed to generate peer organization groups at the level of(organization, title, region), (organization, title), and (organization,region). Alternatively, (peer organization, super title) groups may beformed by analyzing the matrix of peer scores, and the generated groupsare used as parent cohorts to smooth peer organization groups at thetitle level (i.e., peer organization, title).

These techniques may be applied, either individually or in variouscombinations, to output clusters of peer organizations or to outputspecific lists for individual organizations.

FIG. 7 is a flow diagram illustrating a method 700, in accordance withan example embodiment. At operation 702, information about transitionsmade by employees of a first organization to another organization orfrom another organization to the first organization is obtained. Atoperation 704, for each particular organization of a plurality oforganizations other than the first organization, a total number oftransitions between the particular organization and the firstorganization is calculated.

At operation 706, for each of a plurality of combinations of the firstorganization and a different particular organization in the plurality oforganizations other than the first organization, a peer score iscalculated based on the total number of transitions between theparticular organization and the first organization. At operation 708,one or more of the different particular organizations that are part of apeer organization group with the first organization are identified basedon the calculated peer scores.

At operation 710, an deidentified set of confidential data values isobtained for the first organization. At operation 712, it is determinedwhether the deidentified set of confidential data values for the firstorganization contains fewer values than a predetermined threshold forproviding a statistical insight. If not, then at operation 714 astatistical insight can be generated for the first organization based onthe deidentified set of confidential data values for the firstorganization. If so, then at operation 716, one or more confidentialdata values for the first organization are inferred using one or moreconfidential data values for one or more organizations, other than thefirst organization, in the peer organization group. Then at operation718, a statistical insight is generated based on the inferred one ormore confidential data values.

At operation 720, the statistical insight may be displayed in agraphical user interface (GUI) rendered on a display.

FIG. 8 is a screen capture illustrating an organization confidentialdata social networking profile page 800, in accordance with an exampleembodiment. Here, the organization confidential data social networkingprofile page 800 includes one or more statistical insights 802A, 802B,802C garnered from submitted confidential data as described above. Alsodisplayed is an organization peer group 804 including a list of peerorganizations as well as brief statistical insights about each, alsogarnered from the submitted confidential data.

It should be noted that the statistical insights 802A-802C, as well asthe statistical insights in the organization peer group 804, may rely oninferences based on confidential data about other organizations in theorganization peer group, as described in more detail earlier, especiallyin cases where there is a lack of actual confidential data to directlymake the corresponding insight. For example, the statistical insights802A-802C may rely on an inference from confidential data about one ormore of the organizations in the organization peer group because thereare not enough submitted salary or compensation confidential data valuesexplicitly for company A in order to make the insight.

In some example embodiments, Bayesian smoothing is used on confidentialdata for peer organizations of an organization of interest, in order tocorrect for a lack of available pieces of confidential data for theorganization of interest itself and/or prevent detection of individualconfidential data values by users. This solution may be used whetherusing the process for determining peer organizations as described above,or alternatively using another process for determining peerorganizations.

Specifically, a technical problem arises in scenarios where confidentialdata values for an organization are sparse or in scenarios where the useof additional confidential data values for an organization present asecurity problem in that users may be able to detect added confidentialdata values. In the latter scenario, the problem arises in situationswhere the system is designed so that users are not able to viewindividual confidential data values and are only presented withstatistic insights about the confidential data values, such as averagesand medians. An example is a case where the confidential data values aredeidentified and encrypted, such as with confidential salary andcompensation information. While the system may be designed to provideusers with, for example, insight into the average salary for aparticular salary at a particular organization, the system is designednot to share individual salaries. When the number of confidential datavalues for a particular organization is low, however, not only are suchconfidential insights of lesser value (since one confidential data valuecould skew the results), but it is possible for users to determine anexample confidential data value by tracking the change in thestatistical insights over time. For example, if a user is presented witha statistical insight that there are 5 confidential salary values for aparticular title at a particular organization and the average salary is$100,000, and then later the user is presented with a statisticalinsight that there are 6 confidential salary values for that same titleand organization and the average salary is now $116,666, then the userwill be able to calculate that the salary recently submitted was$200,000. This presents a security problem in systems where users whosubmitted the confidential data values were assured that theirsubmissions would only be used for aggregated statistical insights.

One solution for this problem would be for the system to identify anancestral cohort for the cohort in question and generalize the resultsto that ancestral cohort. For example, if there are not enough datavalues for Software Engineers working at LinkedIn with 6-10 years ofexperience in the San Francisco Bay Area, then various ancestral cohortscould be tried until the “best” one that has enough data values can bedetermined. Thus, for example, the results could be generalized toSoftware Engineers working at LinkedIn with 4-15 years of experience inthe San Francisco Bay Area, Software Engineers working at LinkedIn with6-10 years of experience in California, or Software Engineers working atany organization with 6-10 years of experience in the San Francisco BayArea.

This solution, however, is limited in that when generalizing on theorganization attribute (e.g., going from LinkedIn to any organization),there is no intermediate level in the hierarchy in which to generalize.The cohort is either at the organization level (LinkedIn) or so generalit includes every organization. This is unlike the other attributes,such as years experience or location, which can be generalized to anintermediate level (e.g., San Francisco Bay Area can be generalized toCalifornia and doesn't necessarily need to be generalized so far thatlocation is irrelevant).

In an example embodiment, the peer organization groups can be used toform an intermediate-level cohort.

The Bayesian model may be used because it provides a flexible structurefor incorporating external knowledge in the form of a prior. Fororganization c to be studied, its peer organization group may be denotedas pc(c), which contains a list of organizations similar to c. Its priormean and variance may be set to be centered at {right arrow over(μ)}_(pc(c))=Σ_(c′∈pc(c))Σ_(i=1:n) _(c′) {tilde over(y)}_(c′,i)/n_(pc(c));

${\hat{\sigma}}_{{pc}{(c)}}^{2} = {\sum\limits_{c^{\prime} \in {{pc}{(c)}}}{\sum\limits_{i = {1:n_{c^{\prime}}}}\left( {{\overset{\sim}{y}}_{c^{\prime},i} - {{\hat{\mu}}_{{pc}{(c)}}^{2}/{n_{{pc}{(c)}}.}}} \right.}}$

The prior mean and variance μ₀, σ₀ ² can also be global informationestimated by confidential data values in organization set C of all n_(C)organizations as:

${{\hat{\mu}}_{all} = {\sum\limits_{c^{\prime} \in C}{\sum\limits_{i = {1:n_{c^{\prime}}}}{{\overset{\sim}{y}}_{c^{\prime},1}/n_{C}}}}};$${{\hat{\mu}}_{all} = {\sum\limits_{c^{\prime} \in C}{\sum\limits_{i = {1:n_{c^{\prime}}}}{{\overset{\sim}{y}}_{c^{\prime},1}/n_{C}}}}};$

In an example embodiment, an organization's prior is chosen to be peerorganization information when the number of confidential data values inits peer organization group n_(pc(c))=Σ_(c′∈pc(c))n_(c′) is no smallerthan a certain threshold n_(τ) and centered at global informationotherwise. That is,

$\left( {\mu_{0},{\sigma_{0}^{2} = \left\{ \begin{matrix}\left( {\mu_{all},\sigma_{{pc}{(c)}}^{2}} \right. & {{{{if}\mspace{14mu} n_{{pc}{(c)}}} \geq n_{r}},} \\{{\hat{\mu}}_{all},{\hat{\sigma}}_{{pc}{(c)}}^{2}} & {otherwise}\end{matrix} \right.}} \right.$

If D_(c)={{tilde over (y)}_(c,1), . . . , {tilde over (y)}_(c,n) _(c) }denotes the set of n_(c) organization adjusted data of organization c,then all data in D_(c) can be modeled as normal distribution with aconjugate normal-inverse-Gamma prior asModel: {tilde over (y)} _(c,i) ˜N(μ_(c),τ²) for i=1, . . . ,n _(c),Priors: μ_(c)|τ² ˜N(μ₀,τ² /n ₀,τ⁻²˜Gamma(η/σ₀ ²,η),

where n₀=m/δ and δ and η are two hyper-smoothing parameters indicatinghow much information is passed from prior to model and to be optimizedvia cross-validation. The smaller δ and η are, the more information ispassed from the prior. It should be noted that the prior mean of μ_(c)is the same with external data mean μ₀ while the prior distribution ofits precision {tilde over (τ)}:=τ⁻² is also centered at externalprecision mean 1/σ₀ ².

With organization mean denoted as y _(c), the posterior can be updatedas

$\mspace{79mu}{\left. \mu_{c} \middle| \tau^{2} \right.,{D_{C} \sim {N\left( {{{\frac{n_{c}}{n_{c} + n_{0}}{\overset{\_}{y}}_{c}} + {\frac{n_{0}}{n_{c} + n_{0}}\mu_{0}}},\frac{\tau^{2}}{n_{c} + n_{0}}} \right)}},\left. \tau^{- 2} \middle| {D_{C} \sim {{Gamma}\left( {{\frac{n_{c}}{2} + \frac{\eta}{\sigma_{0}^{2}}},{\eta + {\frac{1}{2}{\sum\limits_{i = 1}^{n_{c}}\left( {{\overset{\sim}{y}}_{i} - {\overset{\_}{y}}_{c}} \right)^{2}}} + {\frac{n_{c}n_{0}}{2\left( {n_{c} + n_{0}} \right)}\left( {{\overset{\_}{y}}_{c} - \mu_{0}} \right)^{2}}}} \right)}} \right.}$

By marginalizing out the mean parameter μ_(c) and precision parameterτ⁻², the posterior prediction {tilde over (y)}*_((c)) for organization cis thus a t distribution as

$\left. {\overset{\sim}{y}}_{c} \middle| {D_{c} \sim {t_{{df}_{c}}\left( {m_{c},s_{c}} \right)}} \right.,{where}$${{df}_{c} = {n_{c} + \frac{2\eta}{\sigma_{0}^{2}}}},{m_{c} = {{\frac{n_{c}}{n_{c} + n_{0}}{\overset{\_}{y}}_{c}} + {\frac{n_{0}}{n_{c} + n_{0}}\mu_{0}}}},{s_{c} = {\left( {1 + \frac{1}{n_{c} + n_{0}}} \right){\frac{\eta + {\frac{1}{2}\left\lbrack {{\sum\limits_{i = 1}^{n_{c}}\left( {{\overset{\sim}{y}}_{i} - {\overset{\_}{y}}_{c}} \right)^{2}} + {\frac{n_{c}n_{0}}{n_{c} + n_{0}}\left( {{\overset{\sim}{y}}_{c} - \mu_{0}} \right)^{2}}} \right\rbrack}}{\frac{n_{c}}{2} + \frac{\eta}{\sigma_{0}^{2}}}.}}}$

It should be noted that the posterior mean m_(c) is a weighted sum ofdata mean y _(c) and prior mean μ₀, while the posterior variance is acombination of data variance Σ_(i=1) ^(n) ^(c) ({tilde over (y)}_(i)−y_(c))² and departure of data mean from prior mean (y _(c)μ₀)².

FIG. 9 is a flow diagram illustrating a method 900 in accordance with anexample embodiment. At operation 902, it is determined if a firstdeidentified set of confidential data values for a first organizationcontains fewer values than a predetermined threshold for providing astatistical insight, the first deidentified set of confidential datasubmitted by users in a cohort having a first value for a firstattribute and a second value for a second attribute. If not, then atoperation 904, a statistical insight may be calculated using the firstdeidentified set of confidential data values for the first organization,and the method 900 proceeds to operation 914.

If so, then at operation 906, a second deidentified set of confidentialdata values for organizations identified as peer organizations for thefirst organization may be retrieved. The second deidentified set ofconfidential data values are submitted by users in a cohort having thefirst value for the first attribute and the second value for the secondattribute.

At operation 908, the second deidentified set of confidential datavalues is modeled using a Bayesian model. At operation 910, one or moreconfidential data values are inferred for the first organization usingthe modeled second deidentified set of confidential data values;

At operation 912, a statistical insight is calculated based on theinferred one or more confidential data values.

At operation 914, a graphical user interface is generated containing thestatistical insight. Optionally, a histogram may also be generated inthe graphical user interface.

FIG. 10 is a screen capture of a graphical user interface 1000 inaccordance with an example embodiment. The graphical user interface 1000may include one or more statistical insights 1002A, 1002B related to theinferred one or more confidential data values. A histogram may also begenerated in the graphical user interface 1000 indicating statisticaldistribution of the confidential data values. Here, however, anindication 1004 that there are not enough confidential data values togenerate the histogram is displayed.

FIG. 11 is a block diagram 1100 illustrating a software architecture1102, which can be installed on any one or more of the devices describedabove. FIG. 11 is merely a non-limiting example of a softwarearchitecture, and it will be appreciated that many other architecturescan be implemented to facilitate the functionality described herein. Invarious embodiments, the software architecture 1102 is implemented byhardware such as a machine 1200 of FIG. 12 that includes processors1210, memory 1230, and input/output (I/O) components 1250. In thisexample architecture, the software architecture 1102 can beconceptualized as a stack of layers where each layer may provide aparticular functionality. For example, the software architecture 1102includes layers such as an operating system 1104, libraries 1106,frameworks 1108, and applications 1110. Operationally, the applications1110 invoke API calls 1112 through the software stack and receivemessages 1114 in response to the API calls 1112, consistent with someembodiments.

In various implementations, the operating system 1104 manages hardwareresources and provides common services. The operating system 1104includes, for example, a kernel 1120, services 1122, and drivers 1124.The kernel 1120 acts as an abstraction layer between the hardware andthe other software layers, consistent with some embodiments. Forexample, the kernel 1120 provides memory management, processormanagement (e.g., scheduling), component management, networking, andsecurity settings, among other functionality. The services 1122 canprovide other common services for the other software layers. The drivers1124 are responsible for controlling or interfacing with the underlyinghardware, according to some embodiments. For instance, the drivers 1124can include display drivers, camera drivers, BLUETOOTH® or BLUETOOTH®Low Energy drivers, flash memory drivers, serial communication drivers(e.g., Universal Serial Bus (USB) drivers), Wi-Fi® drivers, audiodrivers, power management drivers, and so forth.

In some embodiments, the libraries 1106 provide a low-level commoninfrastructure utilized by the applications 1110. The libraries 1106 caninclude system libraries 1130 (e.g., C standard library) that canprovide functions such as memory allocation functions, stringmanipulation functions, mathematic functions, and the like. In addition,the libraries 1106 can include API libraries 1132 such as medialibraries (e.g., libraries to support presentation and manipulation ofvarious media formats such as Moving Picture Experts Group-4 (MPEG4),Advanced Video Coding (H.264 or AVC), Moving Picture Experts GroupLayer-3 (MP3), Advanced Audio Coding (AAC), Adaptive Multi-Rate (AMR)audio codec, Joint Photographic Experts Group (JPEG or JPG), or PortableNetwork Graphics (PNG)), graphics libraries (e.g., an OpenGL frameworkused to render in two dimensions (2D) and three dimensions (3D) in agraphic context on a display), database libraries (e.g., SQLite toprovide various relational database functions), web libraries (e.g.,WebKit to provide web browsing functionality), and the like. Thelibraries 1106 can also include a wide variety of other libraries 1134to provide many other APIs to the applications 1110.

The frameworks 1108 provide a high-level common infrastructure that canbe utilized by the applications 1110, according to some embodiments. Forexample, the frameworks 1108 provide various graphic user interface(GUI) functions, high-level resource management, high-level locationservices, and so forth. The frameworks 1108 can provide a broad spectrumof other APIs that can be utilized by the applications 1110, some ofwhich may be specific to a particular operating system 1104 or platform.

In an example embodiment, the applications 1110 include a homeapplication 1150, a contacts application 1152, a browser application1154, a book reader application 1156, a location application 1158, amedia application 1160, a messaging application 1162, a game application1164, and a broad assortment of other applications such as a third-partyapplication 1166. According to some embodiments, the applications 1110are programs that execute functions defined in the programs. Variousprogramming languages can be employed to create one or more of theapplications 1110, structured in a variety of manners, such asobject-oriented programming languages (e.g., Objective-C, Java, or C++)or procedural programming languages (e.g., C or assembly language). In aspecific example, the third-party application 1166 (e.g., an applicationdeveloped using the ANDROID™ or IOS™ software development kit (SDK) byan entity other than the vendor of the particular platform) may bemobile software running on a mobile operating system such as IOS™,ANDROID™, WINDOWS® Phone, or another mobile operating system. In thisexample, the third-party application 1166 can invoke the API calls 1112provided by the operating system 1104 to facilitate functionalitydescribed herein.

FIG. 12 illustrates a diagrammatic representation of a machine 1200 inthe form of a computer system within which a set of instructions may beexecuted for causing the machine to perform any one or more of themethodologies discussed herein, according to an example embodiment.Specifically, FIG. 12 shows a diagrammatic representation of the machine1200 in the example form of a computer system, within which instructions1216 (e.g., software, a program, an application 1110, an applet, an app,or other executable code) for causing the machine 1200 to perform anyone or more of the methodologies discussed herein may be executed. Forexample, the instructions 1216 may cause the machine 1200 to execute themethod 600 of FIG. 6. Additionally, or alternatively, the instructions1216 may implement FIGS. 1-10, and so forth. The instructions 1216transform the general, non-programmed machine 1200 into a particularmachine 1200 programmed to carry out the described and illustratedfunctions in the manner described. In alternative embodiments, themachine 1200 operates as a standalone device or may be coupled (e.g.,networked) to other machines. In a networked deployment, the machine1200 may operate in the capacity of a server machine or a client machinein a server-client network environment, or as a peer machine in apeer-to-peer (or distributed) network environment. The machine 1200 maycomprise, but not be limited to, a server computer, a client computer, aPC, a tablet computer, a laptop computer, a netbook, a set-top box(STB), a portable digital assistant (PDA), an entertainment mediasystem, a cellular telephone, a smartphone, a mobile device, a wearabledevice (e.g., a smart watch), a smart home device (e.g., a smartappliance), other smart devices, a web appliance, a network router, anetwork switch, a network bridge, or any machine capable of executingthe instructions 1216, sequentially or otherwise, that specify actionsto be taken by the machine 1200. Further, while only a single machine1200 is illustrated, the term “machine” shall also be taken to include acollection of machines 1200 that individually or jointly execute theinstructions 1216 to perform any one or more of the methodologiesdiscussed herein.

The machine 1200 may include processors 1210, memory 1230, and I/Ocomponents 1250, which may be configured to communicate with each othersuch as via a bus 1202. In an example embodiment, the processors 1210(e.g., a central processing unit (CPU), a reduced instruction setcomputing (RISC) processor, a complex instruction set computing (CISC)processor, a graphics processing unit (GPU), a digital signal processor(DSP), an application-specific integrated circuit (ASIC), aradio-frequency integrated circuit (RFIC), another processor, or anysuitable combination thereof) may include, for example, a processor 1212and a processor 1214 that may execute the instructions 1216. The term“processor” is intended to include multi-core processors that maycomprise two or more independent processors (sometimes referred to as“cores”) that may execute instructions 1216 contemporaneously. AlthoughFIG. 12 shows multiple processors 1210, the machine 1200 may include asingle processor with a single core, a single processor with multiplecores (e.g., a multi-core processor), multiple processors with a singlecore, multiple processors with multiple cores, or any combinationthereof.

The memory 1230 may include a main memory 1232, a static memory 1234,and a storage unit 1236, all accessible to the processors 1210 such asvia the bus 1202. The main memory 1232, the static memory 1234, and thestorage unit 1236 store the instructions 1216 embodying any one or moreof the methodologies or functions described herein. The instructions1216 may also reside, completely or partially, within the main memory1232, within the static memory 1234, within the storage unit 1236,within at least one of the processors 1210 (e.g., within the processor'scache memory), or any suitable combination thereof, during executionthereof by the machine 1200.

The I/O components 1250 may include a wide variety of components toreceive input, provide output, produce output, transmit information,exchange information, capture measurements, and so on. The specific I/Ocomponents 1250 that are included in a particular machine 1200 willdepend on the type of machine 1200. For example, portable machines suchas mobile phones will likely include a touch input device or other suchinput mechanisms, while a headless server machine will likely notinclude such a touch input device. It will be appreciated that the I/Ocomponents 1250 may include many other components that are not shown inFIG. 12. The I/O components 1250 are grouped according to functionalitymerely for simplifying the following discussion, and the grouping is inno way limiting. In various example embodiments, the I/O components 1250may include output components 1252 and input components 1254. The outputcomponents 1252 may include visual components (e.g., a display such as aplasma display panel (PDP), a light-emitting diode (LED) display, aliquid crystal display (LCD), a projector, or a cathode ray tube (CRT)),acoustic components (e.g., speakers), haptic components (e.g., avibratory motor, resistance mechanisms), other signal generators, and soforth. The input components 1254 may include alphanumeric inputcomponents (e.g., a keyboard, a touch screen configured to receivealphanumeric input, a photo-optical keyboard, or other alphanumericinput components), point-based input components (e.g., a mouse, atouchpad, a trackball, a joystick, a motion sensor, or another pointinginstrument), tactile input components (e.g., a physical button, a touchscreen that provides location and/or force of touches or touch gestures,or other tactile input components), audio input components (e.g., amicrophone), and the like.

In further example embodiments, the I/O components 1250 may includebiometric components 1256, motion components 1258, environmentalcomponents 1260, or position components 1262, among a wide array ofother components. For example, the biometric components 1256 may includecomponents to detect expressions (e.g., hand expressions, facialexpressions, vocal expressions, body gestures, or eye tracking), measurebiosignals (e.g., blood pressure, heart rate, body temperature,perspiration, or brain waves), identify a person (e.g., voiceidentification, retinal identification, facial identification,fingerprint identification, or electroencephalogram-basedidentification), and the like. The motion components 1258 may includeacceleration sensor components (e.g., accelerometer), gravitation sensorcomponents, rotation sensor components (e.g., gyroscope), and so forth.The environmental components 1260 may include, for example, illuminationsensor components (e.g., photometer), temperature sensor components(e.g., one or more thermometers that detect ambient temperature),humidity sensor components, pressure sensor components (e.g.,barometer), acoustic sensor components (e.g., one or more microphonesthat detect background noise), proximity sensor components (e.g.,infrared sensors that detect nearby objects), gas sensors (e.g., gasdetection sensors to detect concentrations of hazardous gases for safetyor to measure pollutants in the atmosphere), or other components thatmay provide indications, measurements, or signals corresponding to asurrounding physical environment. The position components 1262 mayinclude location sensor components (e.g., a Global Positioning System(GPS) receiver component), altitude sensor components (e.g., altimetersor barometers that detect air pressure from which altitude may bederived), orientation sensor components (e.g., magnetometers), and thelike.

Communication may be implemented using a wide variety of technologies.The I/O components 1250 may include communication components 1264operable to couple the machine 1200 to a network 1280 or devices 1270via a coupling 1282 and a coupling 1272, respectively. For example, thecommunication components 1264 may include a network interface componentor another suitable device to interface with the network 1280. Infurther examples, the communication components 1264 may include wiredcommunication components, wireless communication components, cellularcommunication components, near field communication (NFC) components,Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components,and other communication components to provide communication via othermodalities. The devices 1270 may be another machine or any of a widevariety of peripheral devices (e.g., a peripheral device coupled via aUSB).

Moreover, the communication components 1264 may detect identifiers orinclude components operable to detect identifiers. For example, thecommunication components 1264 may include radio frequency identification(RFID) tag reader components, NFC smart tag detection components,optical reader components (e.g., an optical sensor to detectone-dimensional bar codes such as Universal Product Code (UPC) bar code,multi-dimensional bar codes such as Quick Response (QR) code, Azteccode, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2Dbar code, and other optical codes), or acoustic detection components(e.g., microphones to identify tagged audio signals). In addition, avariety of information may be derived via the communication components1264, such as location via Internet Protocol (IP) geolocation, locationvia Wi-Fi® signal triangulation, location via detecting an NFC beaconsignal that may indicate a particular location, and so forth.

Executable Instructions and Machine Storage Medium

The various memories (i.e., 1230, 1232, 1234, and/or memory of theprocessor(s) 1210) and/or the storage unit 1236 may store one or moresets of instructions 1116 and data structures (e.g., software) embodyingor utilized by any one or more of the methodologies or functionsdescribed herein. These instructions (e.g., the instructions 1216), whenexecuted by the processor(s) 1210, cause various operations to implementthe disclosed embodiments.

As used herein, the terms “machine-storage medium,” “device-storagemedium,” and “computer-storage medium” mean the same thing and may beused interchangeably. The terms refer to a single or multiple storagedevices and/or media (e.g., a centralized or distributed database,and/or associated caches and servers) that store executable instructions1216 and/or data. The terms shall accordingly be taken to include, butnot be limited to, solid-state memories, and optical and magnetic media,including memory internal or external to the processors 1210. Specificexamples of machine-storage media, computer-storage media, and/ordevice-storage media include non-volatile memory, including by way ofexample semiconductor memory devices, e.g., erasable programmableread-only memory (EPROM), electrically erasable programmable read-onlymemory (EEPROM), field-programmable gate array (FPGA), and flash memorydevices; magnetic disks such as internal hard disks and removable disks;magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms“machine-storage media,” “computer-storage media,” and “device-storagemedia” specifically exclude carrier waves, modulated data signals, andother such media, at least some of which are covered under the term“signal medium” discussed below.

Transmission Medium

In various example embodiments, one or more portions of the network 1280may be an ad hoc network, an intranet, an extranet, a VPN, a LAN, aWLAN, a WAN, a WWAN, a MAN, the Internet, a portion of the Internet, aportion of the PSTN, a plain old telephone service (POTS) network, acellular telephone network, a wireless network, a Wi-Fi® network,another type of network, or a combination of two or more such networks.For example, the network 1280 or a portion of the network 1280 mayinclude a wireless or cellular network, and the coupling 1282 may be aCode Division Multiple Access (CDMA) connection, a Global System forMobile communications (GSM) connection, or another type of cellular orwireless coupling. In this example, the coupling 1282 may implement anyof a variety of types of data transfer technology, such as SingleCarrier Radio Transmission Technology (1×RTT), Evolution-Data Optimized(EVDO) technology, General Packet Radio Service (GPRS) technology,Enhanced Data rates for GSM Evolution (EDGE) technology, thirdGeneration Partnership Project (3GPP) including 3G, fourth generationwireless (4G) networks, Universal Mobile Telecommunications System(UMTS), High-Speed Packet Access (HSPA), Worldwide Interoperability forMicrowave Access (WiMAX), Long-Term Evolution (LTE) standard, othersdefined by various standard-setting organizations, other long-rangeprotocols, or other data-transfer technology.

The instructions 1216 may be transmitted or received over the network1280 using a transmission medium via a network interface device (e.g., anetwork interface component included in the communication components1264) and utilizing any one of a number of well-known transfer protocols(e.g., HTTP). Similarly, the instructions 1216 may be transmitted orreceived using a transmission medium via the coupling 1272 (e.g., apeer-to-peer coupling) to the devices 1270. The terms “transmissionmedium” and “signal medium” mean the same thing and may be usedinterchangeably in this disclosure. The terms “transmission medium” and“signal medium” shall be taken to include any intangible medium that iscapable of storing, encoding, or carrying the instructions 1216 forexecution by the machine 1200, and include digital or analogcommunications signals or other intangible media to facilitatecommunication of such software. Hence, the terms “transmission medium”and “signal medium” shall be taken to include any form of modulated datasignal, carrier wave, and so forth. The term “modulated data signal”means a signal that has one or more of its characteristics set orchanged in such a manner as to encode information in the signal.

Computer-Readable Medium

The terms “machine-readable medium,” “computer-readable medium,” and“device-readable medium” mean the same thing and may be usedinterchangeably in this disclosure. The terms are defined to includeboth machine-storage media and transmission media. Thus, the termsinclude both storage devices/media and carrier waves/modulated datasignals.

What is claimed is:
 1. A system comprising: a computer-readable mediumhaving instructions stored thereon, which, when executed by a processor,cause the system to: obtain, via one or more graphical user interface, aplurality of electronic data submissions, each electronic datasubmission having user identification information and a correspondingconfidential data value; store each of the electronic data submissionsin a row of a submission table by encrypting the user identificationinformation using a first cryptographic key and storing the encrypteduser identification information in a first column of the submissiontable, and encrypting the confidential data value using a secondcryptographic key different than the first cryptographic key and storingthe encrypted confidential data value in a second column o thesubmission table, thereby deidentifying the confidential data values;determine that a first deidentified set of confidential data values fora first organization contains fewer values than a predeterminedthreshold for providing a statistical insight, the first deidentifiedset of confidential data values submitted by users in a cohort having afirst value for a first attribute and a second value for a secondattribute; in response to the determining: retrieve a seconddeidentified set of confidential data values for organizationsidentified as peer organizations for the first organization, the seconddeidentified set of confidential data values submitted by users in acohort having the first value for the first attribute and the secondvalue for the second attribute; model the second deidentified set ofconfidential data values using a Bayesian model; infer one or moreconfidential data values for the first organization using the modeledsecond deidentified set of confidential data values; calculate astatistical insight based on the inferred one or more confidential datavalues; and cause display of the statistical insight in a graphical userinterface rendered on a display.
 2. The system of claim 1, wherein thefirst attribute is title and the second attribute is region.
 3. Thesystem of claim 1, wherein the confidential data values are compensationvalues for employment.
 4. The system of claim 1, wherein theorganizations are identified as peer organizations for the firstorganization by: obtaining, for the first organization, informationabout transitions made by employees of the first organization to anotherorganization or from another organization to the first organization;calculating, for each particular organization of a plurality oforganizations other than the first organization, a total number oftransitions between the particular organization and the firstorganization; for each of a plurality of combinations of the firstorganization and a different particular organization in the plurality oforganizations other than the first organization, calculate a peer scorebased on the total number of transitions between the particularorganization and the first organization; and identifying one or more ofthe different particular organizations that are part of a peerorganization group with the first organization based on the calculatedpeer scores.
 5. The system of claim 4, wherein the calculating the peerscore includes using negative sampling by randomly retrievinginformation about organizations not having any employees withtransitions to or from the first organization.
 6. The system of claim 4,wherein the identifying comprises: generating a matrix oforganization-organization peer scores; and using matrix factorization togenerate organization group clusters.
 7. The system of claim 1, whereinthe instructions further cause the system to generate a graphical userinterface containing the statistical insight as well as a graphicdepicting a histogram related to the statistical insight.
 8. Acomputerized method, comprising: obtaining, via one or more graphicaluser interface, a plurality of electronic data submissions, eachelectronic data submission having user identification information and acorresponding confidential data value; storing each of the electronicdata submissions in a row of a submission table by encrypting the useridentification information using a first cryptographic key and storingthe encrypted user identification information in a first column of thesubmission table, and encrypting the confidential data value using asecond cryptographic key different than the first cryptographic key andstoring the encrypted confidential data value in a second column o thesubmission table, thereby deidentifying the confidential data values;determining that a first deidentified set of confidential data valuesfor a first organization contains fewer values than a predeterminedthreshold for providing a statistical insight, the first deidentifiedset of confidential data values submitted by users in a cohort having afirst value for a first attribute and a second value for a secondattribute; in response to the determining: retrieving a seconddeidentified set of confidential data values for organizationsidentified as peer organizations for the first organization, the seconddeidentified set of confidential data values submitted by users in acohort having the first value for the first attribute and the secondvalue for the second attribute; modeling the second deidentified set ofconfidential data values using a Bayesian model; inferring one or moreconfidential data values for the first organization using the modeledsecond deidentified set of confidential data values; calculating astatistical insight based on the inferred one or more confidential datavalues; and causing display of the statistical insight in a graphicaluser interface rendered on a display.
 9. The method of claim 8, whereinthe first attribute is title and the second attribute is region.
 10. Themethod of claim 8, wherein the confidential data values are compensationvalues for employment.
 11. The method of claim 8, wherein theorganizations are identified as peer organizations for the firstorganization by: obtaining, for the first organization, informationabout transitions made by employees of the first organization to anotherorganization or from another organization to the first organization;calculating, for each particular organization of a plurality oforganizations other than the first organization, a total number oftransitions between the particular organization and the firstorganization; for each of a plurality of combinations of the firstorganization and a different particular organization in the plurality oforganizations other than the first organization, calculating a peerscore based on the total number of transitions between the particularorganization and the first organization; and identifying one or more ofthe different particular organizations that are part of a peerorganization group with the first organization based on the calculatedpeer scores.
 12. The method of claim 11, wherein the calculating thepeer score includes using negative sampling by randomly retrievinginformation about organizations not having any employees withtransitions to or from the first organization.
 13. The method of claim11, wherein the identifying comprises: generating a matrix oforganization-organization peer scores; and using matrix factorization togenerate organization group clusters.
 14. The method of claim 8, furthercomprising generating a graphical user interface containing thestatistical insight as well a graphic depicting a histogram related tothe statistical insight.
 15. A non-transitory machine-readable storagemedium comprising instructions, which when implemented by one or moremachines, cause the one or more machines to perform operationscomprising: obtaining, via one or more graphical user interface, aplurality of electronic data submissions, each electronic datasubmission having user identification information and a correspondingconfidential data value; storing each of the electronic data submissionsin a row of a submission table by encrypting the user identificationinformation using a first cryptographic key and storing the encrypteduser identification information in a first column of the submissiontable, and encrypting the confidential data value using a secondcryptographic key different than the first cryptographic key and storingthe encrypted confidential data value in a second column o thesubmission table, thereby deidentifying the confidential data values;determining that a first deidentified set of confidential data valuesfor a first organization contains fewer values than a predeterminedthreshold for providing a statistical insight, the first deidentifiedset of confidential data values submitted by users in a cohort having afirst value for a first attribute and a second value for a secondattribute; in response to the determining: retrieving a seconddeidentified set of confidential data values for organizationsidentified as peer organizations for the first organization, the seconddeidentified set of confidential data values submitted by users in acohort having the first value for the first attribute and the secondvalue for the second attribute; modeling the second deidentified set ofconfidential data values using a Bayesian model; inferring one or moreconfidential data values for the first organization using the modeledsecond deidentified set of confidential data values; calculating astatistical insight based on the inferred one or more confidential datavalues; and causing display of the statistical insight in a graphicaluser interface rendered on a display.
 16. The non-transitorymachine-readable storage medium of claim 15, wherein the first attributeis title and the second attribute is region.
 17. The non-transitorymachine-readable storage medium of claim 15, wherein the confidentialdata values are compensation values for employment.
 18. Thenon-transitory machine-readable storage medium of claim 15, wherein theorganizations are identified as peer organizations for the firstorganization by: obtaining, for the first organization, informationabout transitions made by employees of the first organization to anotherorganization or from another organization to the first organization;calculating, for each particular organization of a plurality oforganizations other than the first organization, a total number oftransitions between the particular organization and the firstorganization; for each of a plurality of combinations of the firstorganization and a different particular organization in the plurality oforganizations other than the first organization, calculating a peerscore based on the total number of transitions between the particularorganization and the first organization; and identifying one or more ofthe different particular organizations that are part of a peerorganization group with the first organization based on the calculatedpeer scores.
 19. The non-transitory machine-readable storage medium ofclaim 18, wherein the calculating the peer score includes using negativesampling by randomly retrieving information about organizations nothaving any employees with transitions to or from the first organization.20. The non-transitory machine-readable storage medium of claim 18,wherein the identifying comprises: generating a matrix oforganization-organization peer scores; and using matrix factorization togenerate organization group clusters.